Toward a statistical mechanics of four letter words 
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We consider words as a network of interacting letters, and approximate the probability distribution 
of states taken on by this network. Despite the intuition that the rules of English spelling are highly 
combinatorial (and arbitrary), we find that maximum entropy models consistent with pairwise 
correlations among letters provide a surprisingly good approximation to the full statistics of four 
letter words, capturing ~ 92% of the multi-information among letters and even 'discovering' real 
words that were not represented in the data from which the pairwise correlations were estimated. 
The maximum entropy model defines an energy landscape on the space of possible words, and local 
minima in this landscape account for nearly two-thirds of words used in written English. 



Many complex systems convey an impression of order 
that is not so easily captured by the traditional tools of 
theoretical physics. Thus, it is not clear what sort of 
order parameter or correlation function we should com- 
pute to detect that natural images are composed of solid 
objects P , nor is it obvious what features of the amino 
acid sequence distinguish foldable proteins from random 
polymers. Recently, several groups have tried to sim- 
plify the problem of characterizing order in biological 
systems using the classical idea of maximum entropy ^ . 
Maximum entropy models consistent with pairwise cor- 
relations among neurons have proven surprisingly effec- 
tive in describing the patterns of activity in real net- 
works ranging from the retina [31 |3] to the cortex [5]; 
these models are identical to the Ising models of statis- 
tical mechanics, which have long been explored as ab- 
stract models for neural networks [6]. Similar methods 
have been used to analyze biochemical [7j and genetic 
[5] networks, and these approaches are connected to an 
independent stream of work arguing that pairwise cor- 
relations among amino acids may be sufficient to define 
functional proteins [3]. Because of the immediate con- 
nection to statistical mechanics, this work also provides 
a natural path for extrapolating to the collective behav- 
ior of large networks, starting with real data [TD]. Here 
we test the limits of these ideas constructing maximum 
entropy models for the sequence of letters in words. 

As non-native speakers know well, the rules of En- 
glish spelling seem arbitrary and almost paradigmati- 
cally combinatorial (i before e except after c). In con- 
trast, the whole point of maximum entropy constructions 
based on pairwise correlations is to ignore such higher 
order, combinatorial effects [TT]. We thus suspect that 
the statistics of letters in words will provide an inter- 
esting test case. There is a long history of statistical 
approaches in the analysis of language, including appli- 
cations of maximum entropy ideas ^12j , while opposition 
to such statistical approaches was at the foundation of 
forty years of linguistic theory |13l I14j ; for a recent view 
of these debates see Ref \lS\ . Our goal here is not to en- 



ter into these controversies about language in the broad 
sense, but rather to test the power of pairwise interac- 
tions to capture seemingly complex structure. 

Even with only four letters, there are N = (26)* = 
456, 976 possible words, but only a tiny fraction of these 
are real (or even 'legal') words in English. Our problem 
thus is easy to state: are maximum entropy models pow- 
erful enough to capture this restriction of vocabulary? 

To analyze the interactions among letters, we use two 
large corpora: a collection of novels from Jane Austen 
|16j , and a large sampling of American English contained 
in the American National Corpus ( ANC) [T7] . To control 
for potential typographic errors, words were also checked 
against a large dictionary database [18]. Out of 676,302 
total words in our Austen corpus there were 7114 unique 
words, 763 of which were four letter words; the four let- 
ter words occurred in the corpus a total of 135,441 times. 
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TABLE I: Four letter words in two large corpora, the novels 
of Jane Austen [TS] and the American National Corpus [17| . 
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FIG. 1: (a) The six pairwise marginal distributions of four-letter words sampled from the Jane Austen corpus. Common 
letter pairs such as "th" in pi2 are apparent in their large marginal probability, (b) The iterative scaling algorithm solves the 
constrained maximization problem to high precision. All pairwise marginal components of the full distribution compared to 
the marginals constructed from the computed maximum entropy distribution 



We used the second release of the ANC 2 x 10^ words) , 
and restricted ourselves to words used more than 100 
times, providing 798 unique four letter words occurring 
2,179,108 times. These numbers indicate that we can 
sample the distribution of four letter words with rea- 
sonable confidence. The most common words and their 
probabilities are shown in Table |T] 

We are interested in the full joint distribution 
P{£i,£2, £3,^4) of letters in a four letter word. The max- 
imum possible entropy of this distribution is S'rand = 
41og2(26) = 18.802bits. Letters occur with different 
probabilities, however, so even if we compose words by 
choosing letters independently and at random out of real 
text, the entropy will be lower than this. A more pre- 
cisely defined 'independent model' is the approximation 



(1) 



where we note that each of the P{(i) is different because 
letters are used differently at different positions in the 
word. In the Austen corpus, this independent model has 
an entropy S'ind = 14.083 ± 0.001 bits, while the full dis- 
tribution has entropy of just S{u_ii = 6.92 ± 0.003 bits 
|19j . The difference between these quantities is the 
multi^nformation, / = S'ind — Sf^ii — 7.163 bits, which 
measures the amount of structure or correlation in the 
joint distribution of letters that form real words. Thus, 
spelling rules restrict the vocabulary by a factor of 
2^ ~ 143 relative to the number of words that would 
be allowed if letters were chosen independently. 

Maximum entropy models based on pairwise corre- 
lations are equivalent to Boltzmann distributions with 



pairwise interactions among the elements of the system 
(see, for example Ref [H]); in our case this means ap- 
proximating P(£l,€2,4,4) ~ P'^'^K 



p^^Hh,e2,i3j4)^l 



z 



exp 



i>j 



(2) 



where the Vij are 'interaction potentials' between pairs 
of letters and Z serves to normalize the distribution; be- 
cause the order of letters matters, there are six inde- 
pendent potentials. Each potential is a 26 x 26 matrix, 
but the zero of energy is arbitrary, so this model has 
6 X (26^ — 1) = 4050 parameters, more than lOOx less 
than the number of possible states. Note that interac- 
tions extend across the full length of the word, so that 
the maximum entropy model built from pairwise corre- 
lations is very different from a Markov model which only 
allows each letter to interact with its neighbor. 

We determine the interaction potentials Vij by match- 
ing to the pairwise marginal distributions, that is by 
solving the six coupled sets of 26^ equations: 



(3) 



and similarly for the other five pairs. As shown in Figjlji,, 
the pairwise marginals sampled from English are highly 
structured; many entries in the marginal distributions 
are exactly zero, even in corpora with millions of words. 

Construction of maximum entropy models for large 
systems is difficult piT, but for ~ 5 x 10^ states as in our 
problem relatively simple algorithms suffice [21] . We see 
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from Fig[T]3 that these methods succeed in matching the 
observed pairwise marginals with high precision. 

As shown at left in Figure |2] the maximum entropy 
model with pairwise interactions does a surprisingly 
good job in capturing the structure of the full distribu- 
tion. In the Austen corpus the model predicts an entropy 
S2 = 7.48 bits, which means that it captures 92% of the 
multi-information, and similar results are found with the 
ANC, where we capture 89% of the multi-information. 
Pairwise interactions thus restrict the vocabulary by a 
factor of 2^"^'^~^^ ~ 100 relative to the words which are 
possible by choosing letters independently. 

While very good, the maximum entropy model is of 
course not perfect, as we can see at right in Fig|2] Here 
we see that the probabilities of the individual words pre- 
dicted by the model agree only approximately with the 
observed probabilities. There is good agreement on av- 
erage, especially for the more common words, but sub- 
stantial scatter. On the other hand, there are particu- 
lar words with low probability, whose frequency of use 
is predicted with high accuracy. We have singled out 
two of these, 'edge' and 'itch,' as somewhat surprising. 
Each contains sounds composed of three consonants in 
sequence, and we might have expected that to predict 
the frequency of these combinations one would need to 
incorporate three-body interactions among the letters, 
but it seems that pairwise interactions are sufficient. 

Another way of looking at structure in the distribu- 
tion of words is the Zipf plot 22J, the probability of a 
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FIG. 2: (left) The pairwise maximum entropy model provides 
an excellent approximation to the full distribution of four- 
letter words, capturing 92% of the multi-information, (right- 
dots) Scatter plot of the four letter word probabilities in the 
full distribution Psampied vs. the corresponding probabilities 
in the majcimum entropy distribution P2. (right-red crosses) 
To facilitate the comparison we divided the full probabil- 
ity into 20 equally log-spaced bins and computed the mean 
maximum entropy probability conditioned on the states in 
the full distribution within each bin. The dashed line marks 
the identity, (right-blue circles) Even for small probabilities, 
there are still words such as 'edge' and 'itch' whose states are 
well-captured by the pairwise model. 
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FIG. 3: The Zipf plot for all words in the corpus (black line), 
four letter words in the corpus (blue crosses) , and four letter 
words in the maximum entropy model (red crosses). Green 
circles denote 'non- words', states in the maximum entropy 
model that didn't appear in the corpus. The 25 most likely 
'non-words' are shown in the text inset (ordered in decreasing 
probability from left to right and top to bottom). Some of 
these are recognizable as real words that just did not appear 
in the corpus, and even the others have plausible spelling. 

word's use as a function of its rank (Fig [3]). If we look at 
all words, we recover the approximate power law which 
initially intrigued Zipf; when we confine our attention to 
four letter words, the long tail is cut off [33] • The maxi- 
mum entropy model does a good job of reproducing the 
observed Zipf plot, but removes some weight from the 
bulk of the distribution and reassigns it to words which 
do not occur in the corpus, repopulating the tail. Im- 
portantly, as shown in the inset to Fig [Sj many of these 
'non-words' are perfectly good English words which hap- 
pen not to have been used by Jane Austen. Quantita- 
tively, the maximum entropy models assigns 15% of the 
probability to words which do not appear in the corpus, 
but of these 1 /5 are words that can be found in the dic- 
tionary, and the same factor for the 'correct discovery' of 
new words is found with the ANC. Although somewhat 
subjective, we note that even words not found in the 
dictionary seem to be speakable combinations of letters, 
not obviously violating known spelling rules. 

If we take Equation ^ seriously as a statistical me- 
chanics problem, then we have a constructed an energy 
landscape on the space of words, much in the spirit of 
Hopfield's energy landscape for the states of neural net- 
works [6] . In this landscape there are local minima, com- 
binations of letters for which any single letter change will 
result in an increase in the energy. In the maximum en- 
tropy model for the ANC, there are 136 of these local 
minima, of which 118 are real English words, capturing 
nearly two thirds (63.5%) of the full distribution; similar 
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results are obtained in the Austen corpus. We note that 
such 'stable words' have the property that any single let- 
ter spelling error can always be corrected by relaxing to 
the nearest local minimum of the energy. It is tempting 
to suggest that if we could construct the energy land- 
scape for sentences (rather than for words), then almost 
all legal sentences would be locally stable. 

To summarize, in English we use only a very small 
fraction (~ 1/3700) of the roughly half million possi- 
ble four letter combinations. The hierarchy of maximum 
entropy constructions [TP allows us to decompose these 
spelling rules into contributions from interactions of dif- 
ferent order. A significant factor 1/26) comes from 
the unequal probabilities with which individual letters 
are used, a larger factor (~ 1/100) comes from the pair- 
wise interactions among letters, and higher order inter- 
actions contribute only a small factor (~ 1/1.5). The 
pairwise model represents an enormous simplification, 
which nonetheless has the power to capture most of the 
structure in the distribution of letters and even to dis- 
cover combinations of letters that are legal but unused in 
the corpora from which we have learned. The analogy to 
statistical mechanics also invites us to think about the 
way in which combinations of competing interactions en- 
force a complex landscape, singling out words which can 
be transmitted stably even in the presence of errors. Al- 
though our primary interest has been to test the power of 
the maximum entropy models, these ideas of generaliza- 
tion and error correction seem relevant to understanding 
the cognitive processing of text [T5J |23] . 
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